26 March 2019

Overview

  • Why bother?
  • What is Data Science?
  • What are Data Scientists?
  • What is Machine Learning?
  • What types of problems can Machine Learning solve?
  • AI ≠ Machine Learning?
  • How does Machine Learning work?

Why bother?

Google searches for "Machine Learning"

[source: trends.google.com]

What is Data Science?

What is Data Science?

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.

[Source: Wikipedia]

What is Data Science?

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.

[source: gartner.com]

What is Data Science?

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.

[source: fscj.edu]

What is Data Science?

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.

[source: bdbizviz.com]

What is Data Science?

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.

[source: wikipedia.org]

What is Data Science?

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.

[source: carestruck.org]

What are Data Scientists?

Stats on Data Scientists

[www.oreilly.com/data/free/2017-data-science-salary-survey.csp]

What industries do Data Scientists work in?

How do Data Scientists spent their time?

What tasks do Data Scientists work on?

What education do Data Scientists have?

What do Data Scientists earn?

What is Machine Learning (ML)?

Machine Learning – Definition

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

[Tom M. Mitchell]

 

→ Template to describe complex problems with less ambiguity:

  • Experience E = Data required
  • Task T = Problem (class) to solve
  • Measure P = Metric to evaluate results

Example – Detect spam emails

  • Experience EData set of emails with examples of spam and ham
  • Task T = Classify an email as either spam or ham
  • Measure P = Accuracy of correctly classified emails

→ Preparing decision making program to solve this task is called training

→ Collected email examples are called the training set

→ The program is referred to as a model
(as in a model of the problem of classifying spam from non-spam)

Types of Machine Learning

Machine Learning tasks T typically classified into two broad categories:

Supervised learning:

The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.

Unsupervised learning:

No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).

AI ≠ Machine Learning?

AI is the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals

[Poole, Mackworth, Goebel (1998). Computational Intelligence: A Logical Approach]

Machine learning is a subset of artificial intelligence that is concerned with the construction and study of systems that can learn from data.

What types of problems can ML solve?

What types of problems can ML solve?

  • Classification
  • Regression
  • Association rules
  • Clustering
  • Recommending

Classification

  • Given a set of records, \(X = \{x_1 , \dots , x_n \}\)
  • Each record \(x_i = \{x_{i_1} , \dots, x_{i_m} \}\) is a set of \(m\) attributes
  • Each record has additional attribute \(l_i \in L\), with \(L\) being finite set of class labels
  • Find a function \(f\) such that \(f (x_i) \approx l_i\)
  • Task: Compute \(l_j\) for previously unseen records \(x_j\) as accurately as possible

Classification – Direct marketing example

  • Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a product
  • Approach:
    • Use historic data of the same or a similar product
    • We know which customers decided (not) to buy
    • {buy, don’t buy} are labels to learn
    • Collect various demographic, lifestyle, transaction information about customers (type of business, where they live, how much they earn, etc.)
    • Use this information as input attributes to learn a classifier mode

Classification – Fraud detection

  • Goal: Predict fraudulent credit card transactions
  • Approach:
    • Use credit card transactions and information on its account-holder as attributes (when does a customer buy, what does he buy, how often he pays on time, etc)
    • Label past transactions as fraud or fair transactions → label
    • Learn a model for the class of the transactions
    • Use this model to detect fraud by observing credit card transactions on an account

Classification – Customer churn

  • Goal: Predict whether a customer is likely to be lost
  • Approach:
    • Use transaction records to capture customer behaviour
      (attributes around recency and frequency of service usage and transaction volumes)
    • Enrich transactions with demographics and customer specific data
    • Define "churn" and label customers
    • Find a model for churners
    • Start predicting

Regression

  • Given a set of records, \(X = \{x_1 , \dots , x_n \}\)
  • Each record \(x_i = \{x_{i_1} , \dots, x_{i_m} \}\) is a set of \(m\) attributes
  • Each record has additional attribute \(y_i \in \mathbb{R}\) (prediction target)
  • Find a function \(f\) such that \(f (x_i) \approx y_i\)
  • Task: Compute \(y_j\) for previously unseen records \(x_j\) as accurately as possible

Regression – Predicting house prices

  • Goal: Predict house price
  • Approach:
    • Collect property data (characteristics, year built, location, school zones, sales season)
    • Link property with sales history
    • Train a model for the sales price
    • Use model to predict sales prices of unseen houses
  • Commercial examples:

Regression – Modeling salaries

[source: www.oreilly.com/data/free/2017-data-science-salary-survey.csp]

Association rules

  • Ideas come from market basket analysis
  • What products are frequently bought together?
  • How should shelves be managed?
  • What impact does the discontinuation of a product have on the sales of other products?

Goal: Find frequent/interesting patterns, associations, correlations among sets of items in a transactional database

Association Rules – Marketing & sales promotion

  • Let the rule discovered be:
    {Bagels} → {Potato chips}
  • Potato chips as consequent
    → Determine what should be done to boost its sales
  • Bagels in the antecedent
    → Which products would be affected if the store discontinues selling bagels

Association Rules – Shelf management

  • Goal: Identify items that are bought together by many customers
  • Approach: Process the point-of-sale data collected find groups of items frequently bought together
  • Classic rule in literature:
    If a customer buys diaper and milk, then he is very likely to buy beer

Clustering

  • Task: Given a set of data points and a similarity measure among them, find clusters such that
    • Data points in same cluster are more similar to one another
    • Data points in separate clusters are less similar to one another
  • Similarity Measures:
    • Euclidean distance (continuous attributes)
    • Problem-specific measures (sounds)

Clustering – Principle

Example: Assume we are given the following records

Clustering – Principle

We could group the records into 4 different clusters…

Clustering – Principle

… or into 5 clusters

Clustering – Market segmentation

  • Goal: Subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target
  • Approach:
    • Collect different attributes of customers based on their geographical and lifestyle related information
    • Find clusters of "similar" customers
    • Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters

Recommending

  • Estimate a utility function that predicts how a user will like an item
  • Estimation based on:
    past behaviour, relations to other users, item similarity, context

[source: towardsdatascience.com]

Recommending – Online retail products

  • Goal: Suggest products to the user of an online retail store to increase sales
  • Approach:
    • Identify similar users based on their past purchase behaviour
      → recommend their purchases
    • Recommend similar products that a user has purchased in the past
    • If purchase history not available, recommend best-selling products

How does ML work?

Lots of confusion about what ML is…

[source: xkcd.com]

Machine learning example – Decision trees

  • Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is represented as a decision tree
  • Decision tree representation:
    • Each internal node tests an attribute
    • Each branch corresponds to attribute value or range
    • Each leaf node assigns a classification

Toy example – Tennis data set

play outlook temp humidity windy
no sunny hot high false
no sunny hot high true
yes overcast hot high false
yes rainy mild high false
yes rainy cool normal false
no rainy cool normal true
yes overcast cool normal true
no sunny mild high false
yes sunny cool normal false
yes rainy mild normal false
yes sunny mild normal true
yes overcast mild high true
yes overcast hot normal false
no rainy mild high true

Decision tree – Root node "outlook"

Decision tree – Root node "humidity"

Which decision tree is more practical?

Creating decision trees

What is the best attribute to create a split?

  • Heuristic: Choose attribute producing "purest" nodes
  • Purity criterion for splitting: information gain
  • Information gain increases with average purity of subsets that an attribute produces

→ Strategy: Choose attribute that results in greatest information gain

Information gain – Entropy

  • Entropy \(H(X)\) of a random variable \(X\) with possible values \(\{x_1, \dots, x_k\}\) is defined as: \[H(X) = - \sum_{i=1}^{k}p_i \log_2 p_i \]
    with \(p_i = P(X = x_i)\)
  • Usually use logarithm of base 2 for entropy → unit of entropy is bit

Tennis data set – Sorted by "play"

play outlook temp humidity windy
no sunny hot high false
no sunny hot high true
no rainy cool normal true
no sunny mild high false
no rainy mild high true
yes overcast hot high false
yes rainy mild high false
yes rainy cool normal false
yes overcast cool normal true
yes sunny cool normal false
yes rainy mild normal false
yes sunny mild normal true
yes overcast mild high true
yes overcast hot normal false

Information gain – Entropy of variable "play"

  • Let's assume the attribute \(\mbox{play}\) of the tennis data is a random variable with the two outcomes \(\{\text{no}, \text{yes}\}\)
  • Entropy of \(\mbox{play}\) is defined as \[H(\mbox{play}) = - (p_{\text{no}} \log_2 p_{\text{no}} + p_{\text{yes}} \log_2 p_{\text{yes}})\]
  • # of samples where \(play = no\) is 5 out of 14
    → \(p_{\text{no}} = 5/14\)
  • # of samples where \(play = yes\) is 9 out of 14
    → \(p_{\text{yes} }= 9/14\)
  • \(H(\mbox{play}) = - (5/14 \times \log_2 5/15 + 9/14 \times \log_2 9/14 ) = 0.94\)

Information gain – Definition

  • Let \(T\) be a set of training samples
  • Partition \(T\) into disjoint exhaustive subsets \(T_1 , \dots , T_k\) on basis of the \(k\) values of categorical attributes \(S\)
  • Information gain is then defined as
    \[ \mbox{gain}(S, T) = H(T) - \sum_{i=1}^{k} \frac{|T_i|}{|T|} H(T_i) \]
  • Strategy to build minimal tree:
    Choose attribute \(S\) resulting in greatest information gain

Tennis data set – Sorted by "outlook"

play outlook temp humidity windy
yes overcast hot high false
yes overcast cool normal true
yes overcast mild high true
yes overcast hot normal false
no rainy cool normal true
no rainy mild high true
yes rainy mild high false
yes rainy cool normal false
yes rainy mild normal false
no sunny hot high false
no sunny hot high true
no sunny mild high false
yes sunny cool normal false
yes sunny mild normal true

Information gain for variable "outlook"

Let’s choose \(S=\mbox{outlook}\) and compute \(\mbox{gain}(\mbox{outlook},T)\): \[ \begin{eqnarray} \mbox{gain}(\mbox{outlook},T) & = & H(T) - \sum_{i \in \{\textsf{sun,overcast,rain}\} } \frac{|T_i|}{|T|} H(T_i) \nonumber \\ & = & H(T) - \\ & & \Bigl( \frac{5}{14} \times \Bigl( -\Bigl(\frac{2}{5} \times \log_2 \frac{2}{5} + \frac{3}{5} \times \log_2 \frac{3}{5} \Bigr) \Bigr) + \nonumber \\ & & \ \ \frac{4}{14} \times \Bigl( −\Bigl( \frac{0}{4} \times \log_2 \frac{0}{4} + \frac{4}{4} \times \log_2 \frac{4}{4} \Bigr) \Bigr) + \nonumber \\ & & \ \ \frac{5}{14} \times \Bigl( −\Bigl( \frac{3}{5} \times \log_2 \frac{3}{5} + \frac{2}{5} \times \log_2 \frac{2}{5} \Bigr) \Bigr) \Bigr) \nonumber \\ & = & 0.94 - (0.347 + 0 + 0.347) \nonumber \\ & = & 0.247 \nonumber \end{eqnarray} \]

Information gain for other variables

Compute gain for all attributes:

\(\mbox{gain}(\mbox{outlook}, T) = 0.247\)

\(\mbox{gain}(\mbox{temp}, T) = 0.029\)

\(\mbox{gain}(\mbox{humidity}, T ) = 0.152\)

\(\mbox{gain}(\mbox{windy}, T ) = 0.048\)

→ Choose attribute corresponding to maximum gain for splitting: "outlook"

Continuing the split

Final tree

Few notes on decision trees

  • Algorithm can create over-complex trees that do not generalize very well
    → Mechanisms such as pruning are necessary to avoid this problem
  • Trees can be extended to process numerical and missing attributes
  • Trees can also solve regression problems
  • Construction of trees is computationally very efficient
    → Popular choice for boosting and bagging approaches

Summary

  • Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data
  • Machine learning is a computer program that improves its performance on a task through experience (data)
  • Classification, regression, association rules, clustering and recommendations are typical ML problems
  • Decision trees as an example machine learning algorithm

Questions?